28 research outputs found

    The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

    Full text link
    Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semi-automatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.Comment: Published in the Proceedings of the European Conference on Information Retrieval (ECIR) 201

    Analyzing web archives through topic and event focused sub-collections

    No full text
    Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections

    Should I Care about Your Opinion? : Detection of Opinion Interestingness and Dynamics in Social Media

    Get PDF
    In this paper, we describe a set of reusable text processing components for extracting opinionated information from social media, rating it for interestingness, and for detecting opinion events. We have developed applications in GATE to extract named entities, terms and events and to detect opinions about them, which are then used as the starting point for opinion event detection. The opinions are then aggregated over larger sections of text, to give some overall sentiment about topics and documents, and also some degree of information about interestingness based on opinion diversity. We go beyond traditional opinion mining techniques in a number of ways: by focusing on specific opinion-target extraction related to key terms and events, by examining and dealing with a number of specific linguistic phenomena, by analysing and visualising opinion dynamics over time, and by aggregating the opinions in different ways for a more flexible view of the information contained in the documents.EU/27023

    Evaluation of Methods and Techniques for Language Based Sentiment Analysis for DAX 30 Stock Exchange A First Concept of a “LUGO†Sentiment Indicator

    Get PDF
    Social media companies are famous for creating communities, or their companies for IPOs. However, social media are as well utilized in stock exchange trading and for product promotion of securities of financial investment companies. Especially stock exchange trading is many times based on sentiment, thus fast spreading rumors and news. Within the scope of this publication, we aim at an evaluation of potential methods and techniques for language based sentiment analysis for the purpose of stock exchange trading. Within the scope of this publication we evaluate a possible technique to obtain a technical indicator based on social media, which should support investment decisions. We present a basic experimental setup and try to describe the LUGO Sentiment Indicator as possible tool for supporting investment decisions based on a social media sentiment analysis

    Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

    Get PDF
    Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

    Mouse models of breast cancer metastasis

    Get PDF
    Metastatic spread of cancer cells is the main cause of death of breast cancer patients, and elucidation of the molecular mechanisms underlying this process is a major focus in cancer research. The identification of appropriate therapeutic targets and proof-of-concept experimentation involves an increasing number of experimental mouse models, including spontaneous and chemically induced carcinogenesis, tumor transplantation, and transgenic and/or knockout mice. Here we give a progress report on how mouse models have contributed to our understanding of the molecular processes underlying breast cancer metastasis and on how such experimentation can open new avenues to the development of innovative cancer therapy

    Extracting event-centric document collections from large-scale web archives

    No full text
    Web archives created by the Internet Archive (IA) (https://archive.org), national libraries and other archiving services contain large amounts of information collected for a time period of over twenty years. These archives constitute a valuable source for research in many disciplines, including the digital humanities and the historical sciences by offering a unique possibility to look into past events and their representation on the Web. Most Web archive services aim to capture the entire Web (IA) or national top-level domains and are therefore broad in their scope, diverse regarding the topics they contain and the time intervals they cover. Due to the large size and the broad scope it is difficult for interested researchers to locate relevant information in the archives as search facilities are very limited. Many users are more interested in studying smaller and topically coherent event-centric collections of documents contained in a Web archive [1,2]. Such collections can reflect specific events such as elections, or natural disasters, e.g. the Fukushima nuclear disaster (2011) or the German federal elections
    corecore